Add schema-aware intelligent column mapping pipeline by arsh0198 · Pull Request #2 · sbabyanusha/cBioAbstractor

arsh0198 · 2026-05-18T18:42:51Z

Summary

This PR extends the supplemental data formatting workflow by introducing a schema-aware column mapping pipeline for heterogeneous supplemental datasets.

Added Features

Intelligent column normalization
Schema-aware fuzzy column mapping using RapidFuzz
Automatic schema detection logic
Validation for required schema fields
Modularized formatting pipeline architecture
CLI-based processing workflow

New Components

mapper.py
schemas.py
validator.py

Example

Input columns:

Tumor Sample Barcode
Gene Name
Patient Identifier

Automatically mapped to:

SAMPLE_ID
HUGO_SYMBOL
PATIENT_ID

Goal

This contribution moves the formatter toward a reusable schema-driven supplemental data curation pipeline and reduces manual preprocessing effort for curators.

Arsh abbasi added 2 commits May 18, 2026 18:03

Add dataframe normalization layer for cBioPortal column mapping

baf48e1

Add schema-aware intelligent column mapping pipeline

ef3c934

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add schema-aware intelligent column mapping pipeline#2

Add schema-aware intelligent column mapping pipeline#2
arsh0198 wants to merge 2 commits into
sbabyanusha:mainfrom
arsh0198:add-normalization-layer

arsh0198 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arsh0198 commented May 18, 2026

Summary

Added Features

New Components

Example

Goal

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant